Network file deduplication using decaying bloom filters

ABSTRACT

A system for receiving and deduplicating data strings transmitted over a network is disclosed. The system comprises one or more network sensors detecting data strings while in transit on the network; and non-transitory memory comprising instructions. When the instructions are executed by one or more processors, the one or more processors establish a plurality of Bloom filters, receive a first data string, perform a first insertion operation into each Bloom filter; determine, for each of one or more Bloom filters, a set of bits, whether presently set or cleared, to be unset; and unset each determined set of bits in the one or more Bloom filters. At a later moment in time, the first data string is received again, and each Bloom filter is queried to determine whether the first data string has been inserted, based on a current state of that Bloom filter.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of and claims priority to U.S. patentapplication Ser. No. 17/503,252, filed Oct. 15, 2021, and also titled“NETWORK FILE DEDUPLICATION USING DECAYING BLOOM FILTERS”, which ishereby incorporated by reference in its entirety. This application alsohas a sister application that claims priority to the same parent and hasbeen filed on the same day, Oct. 5, 2023, also entitled “NETWORK FILEDEDUPLICATION USING DECAYING BLOOM FILTERS”.

FIELD OF INVENTION

This disclosure relates to systems and methods for deduplication andcaching, and more specifically, to systems and methods for increasingefficiency of high-volume caching of previously unseen files or otherdata strings, through use of multiple non-persistent Bloom filters.

BACKGROUND

With ever-evolving malware campaigns targeting various networks,institutions, and businesses, there is a perpetual need to track whatfiles are being sent through networks and respond as quickly as possiblewhen a file is being propagated through a network and may containmalware. Because static and dynamic analysis of potential malware is toocomputationally expensive to perform on every instance of every filesent, there is an evident benefit to analyzing only the first instanceof a file being seen, store that analysis, and analyze subsequent filesonly if an analysis has not already been performed on another copy ofthat file. Such a method requires some form of data structure to trackthese past sightings for purposes of deduplication. Similar use casesexist for systems that must cache a large number of static files, like aweb crawler used by a search engine, or for other deduplication ingeneral.

One common data structure for storing whether a file has been seenbefore is the Bloom filter. Traditional Bloom filters are a variant ofhash table where insertion involves generating K hashes of an input (forsome K greater than one), and setting K corresponding bits of the hashtable to true. The item itself is not stored in the Bloom filter; only asubset of bits not permitting reconstruction of the item are affected.To look up whether an item has been inserted, the K hashes arere-generated for that item and each corresponding bit of the table ischecked. If all K bits are set, the item is assumed to have beeninserted.

However, there is always a possibility of a false positive, as multipleitems may have been inserted whose hashes collectively overlap with allK hashes of a not-inserted item. The initial probability of a falsepositive can be controlled before runtime by the choice of K or thechoice of the number of bits M in the table, and the probability of afalse positive increases at runtime as the table becomes more and moresaturated with set bits. The Bloom filter is considered fully saturatedwhen enough bits have been set that the probability of any given filereceiving a false positive exceeds an acceptable threshold rate. Addingadditional Bloom filters or increasing the size of existing Bloomfilters reduces saturation temporarily, but if the volume of fileinsertions remains high for a sustained period of time, it is not afeasible long-term solution.

As a result, there are advantages to developing systems with Bloomfilters that can be used for a longer period of time, or indefinitely,and with a greater number of insertions without becoming fully saturatedand returning unacceptable false positives in response to queries.

SUMMARY OF THE INVENTION

A system for receiving and deduplicating data strings transmitted over anetwork is disclosed. The system comprises one or more network sensorsdetecting data strings while in transit on the network; andnon-transitory memory comprising instructions. When the instructions areexecuted by one or more processors, the one or more processors establisha plurality of Bloom filters, receive a first file, perform a firstinsertion operation into each Bloom filter; determine, for each of oneor more Bloom filters, a set of bits, whether presently set or cleared,to be unset; and unset each determined set of bits in the one or moreBloom filters. At a later moment in time, the first data string isreceived again, and each Bloom filter is queried to determine whetherthe first data string has been inserted, based on a current state ofthat Bloom filter.

Similarly, a computer-implemented method for receiving and deduplicatingdata strings transmitted over a network is disclosed. The methodcomprises establishing a plurality of Bloom filters; receiving a firstdata string; generating a set of distinct hashes of the first datastring; performing a first insertion operation into each Bloom filter ofthe plurality of Bloom filters; determining, for each of one or moreBloom filters of the plurality of Bloom filters, a set of bits, whetherpresently set or cleared, to be unset; and unsetting each determined setof bits in the one or more Bloom filters. At a later moment in time, themethod continues by receiving the first data string again and queryingeach Bloom filter of the plurality of Bloom filters to determine whetherthe first data string has been inserted, based on a current state ofthat Bloom filter.

Additional features include variations of the above system and methodwhere each Bloom filter has bits unset simultaneously in a staggeredsweeping pattern, where Bloom filters take turns being the only one tohave bits unset, and where none of the Bloom filters is an authoritativefilter that stores all the set bits that other filters may be missing.The bit unsetting process, bit selection process, Bloom filter selectionprocess, and/or decay trigger may be based on the level of saturation ofthe Bloom filters, be based on the passage of time, be constant, bedynamic, be random, occur upon every insertion or a threshold ofinsertion counts, or be a combination of these factors.

BRIEF DESCRIPTION OF THE DRAWINGS

Other aspects, features and advantages will become more fully apparentfrom the following detailed description, the appended claims, and theaccompanying drawings (provided solely for purposes of illustrationwithout restricting the scope of any embodiment), of which:

FIG. 1 illustrates, in simplified form, a system of network sensors andcomputing devices used to track files being transmitted through anetwork;

FIG. 2 illustrates, in simplified form, a method for checking whether afile has been seen before and inserting it into or updating it within aset of Bloom filters;

FIG. 3 illustrates, in simplified form, a set of Bloom filters after afirst file insertion according to the method illustrated by FIG. 2 ;

FIG. 4 illustrates, in simplified form, the set of Bloom filters fromFIG. 3 after a series of insertions and a sweeping reset of bitsaccording to the method illustrated by FIG. 2 ;

FIG. 5 illustrates, in simplified form, an alternative method forchecking whether a file has been seen before and inserting it into orupdating it within a set of Bloom filters;

FIG. 6 illustrates, in simplified form, a set of decaying Bloom filtersafter a first file insertion according to the method illustrated by FIG.5 ;

FIG. 7 illustrates, in simplified form, the set of decaying Bloomfilters from FIG. 6 after a series of insertions and an alternatingdecay of bits according to the method illustrated by FIG. 5 ;

FIG. 8 depicts a graph displaying the experimental results of insertionsin the first style of staggered sweeping bit change in a three-Bloomfilter system;

FIG. 9 depicts a graph displaying experimental results of insertions inthe second style of alternating decay in a five-Bloom filter system; and

FIG. 10 is a high-level block diagram of a representative computingdevice that may be utilized to implement various features and processesdescribed herein.

DETAILED DESCRIPTION

The issue of Bloom filters becoming oversaturated over a period of timemay be addressed by creating a system in which multiple Bloom filtersare used simultaneously, and over time, differing subsets of bits ineach Bloom filter are permitted to “decay” and are unset back to a falsevalue. When the system performs a deduplication check to see if a filehas been previously inserted, there is a query performed within eachBloom filter or within a selected subset of the Bloom filters (such asthe set of filters that are not currently undergoing decay when decayalternates between filters), and the overall system returns a “true” onthe insertion check if any of the Bloom filters queried reported a“true.” The decaying process may ultimately result, given enough time,in a false negative result for every item that has previously beeninserted into the Bloom filters without being refreshed by repeatedsightings. However, this trade-off of a small increase in falsenegatives for a massive decrease in false positives, achieved withoutinfinitely expanding the size of the Bloom filters, can have benefits inspecific use cases.

For example, malware analysis on all files transmitted over a network isa scenario where there will be a sustained, extremely high volume ofinsertions and deduplication checks, but if a file is ever transmittedon the network again, it is most likely to happen within the first fewdays after its first insertion (when an email is forwarded, or when adownload link has been sent to multiple recipients and each recipientdownloads it shortly after receipt). After the initial burst oftransmissions or downloads, the new file is relatively unlikely to beseen again. If the new file does happen to be seen again after thatinitial period of time, it may be an acceptable outcome to re-performany computation associated with a “new” file, or at least to send thefile to a second level of deduplication that is more rigorous and slowerthan the Bloom filter lookup. As used throughout this disclosure, a“false positive” is a response to the question “Has this file beenencountered before?” with “Yes” when the file is entirely novel, and a“false negative” is a response to the question “Has this file beenencountered before?” with “No” with the file has previously beenreceived.

When bits are periodically unset in the Bloom filters being used fordeduplication, it is possible to avoid full saturation. As a result,when a new file comes along that has never been seen before, theprobability of a false positive due to saturation and subsequentignoring of the file is minimized. If a file comes along which hasalready been seen, during the initial minutes, hours, or days since thelast time it was seen (when it is most likely to be seen again), a falsenegative for that file will be least likely and the file will not beneedlessly re-analyzed, and only later, after the file is not likely tobe seen again, will bits be unset and increase the chance that a falsenegative occurs. In the framework of malware analysis, a false positiveis much more damaging than a false negative, since a false negativeleads to re-analysis of an already encountered file and waste ofresources, but a false positive leads to throwing away a file the firsttime it is encountered, when its information might have been sorelyneeded to analyze as a possible threat.

Unlike a traditional Bloom filter, a countably infinite number ofinsertions can be performed without the false positive rate approaching100%, at the cost of a probabilistic chance that a false negative willoccur, but with the guarantee that a false negative will not occur forat least a minimum number of insertions or a minimum window of time.Also, unlike traditional multi-filter setups, no filter is defined asthe unique authoritative filter that has a superset of all otherfilters' bits and is used to confirm a “not before seen” determinationwhen a Bloom filter with fewer set bits fails to confirm a sighting.Instead, each filter has different, partially overlapping sets of bitsthat have been changed to true, and each is consulted and given equalweight.

FIG. 1 illustrates, in simplified form, a system of network sensors andcomputing devices used to track files being transmitted through anetwork.

With reference now to FIG. 1 , a number of network sensors 100 may bedistributed at the edges of or internal to some form of data network105. The network 105 may be any form or wired or wireless network,including a LAN, WLAN, VPN, ethernet, portion of the Internet, etc. Thenetwork sensors 100 are capable of intercepting communications betweencomputers 110 outside the network and computers 115 inside the network,or between two or more computers 115 inside the network. The networksensors 100 are, in a preferred embodiment, low-latency routers ornetwork taps that make a copy of network traffic before forwarding thepackets to their destination, though in other embodiments, they may beoff-the-shelf routers configured to run additional custom software, ormay even be general purpose computing devices or servers.

In some embodiments, related specifically to malware analysis, one ormore of the network sensors 100 may be in communication with a database120 (for caching files if the network sensor 100 determines the file hasnot been seen before, or for acting as a second round of deduplicationif the network sensor wants to confirm that the file has not been seenbefore) and/or an analysis system 125. Upon receiving a new file, thenetwork sensor 100 may transmit the file to the database 120 or theanalysis system 125, so that the database 120 may update its cache toinclude the file, and so that the analysis system 125 may beginperforming static or dynamic analysis upon it as possible malware (forexample, checking the file for suspicious substrings or running the filein a sandbox environment to determine its behavior).

The network sensors 100 and analysis system 125 may have connections toone or more external computing systems via the network 105 or othernetworks, for various purposes such as notifying human users or thirdparty systems that a file has been seen, that one or more analyses hasbeen performed, and/or what the results of the analyses were.

Although a particular division of functions between devices is describedabove with relation to the systems depicted in FIG. 1 , otherconfigurations are possible in which functions are divided among devicesdifferently. For example, all of the functions of some or all of anetwork sensor 100, the database 120, and the analysis system 125 may beperformed by a single device with multiple threads executing differentsoftware modules simultaneously.

Alternatively, the database 120 and/or analysis system 125 may in factbe a cluster of computing devices sharing functionality for concurrentprocessing. Further, although these various computing elements aredescribed as if they are one computing device or cluster each, acloud-based solution with multiple access points to similar systems thatsynchronize their data and are all available as backups to one anothermay be preferable in some embodiments to a unique set of computingdevices all stored at one location. The specific number of computingdevices and whether communication between them is network transmissionbetween separate computing devices or accessing a local memory of asingle computing device is not so important as the functionality thateach part has in the overall scheme.

Running on each of the network sensors 100 is the software for filededuplication that informs the later analysis of the file or other useof the file. FIG. 2 illustrates, in simplified form, a method forchecking whether a file has been seen before and inserting it into orupdating it within a set of Bloom filters.

First, at least two Bloom filters are instantiated (Step 200) to recordinstances of files being observed in the network 105. In someembodiments, only two Bloom filters are created, while in others, three,four, five, or even more Bloom filters may be instantiated, depending ondesired trade-offs of size, speed, and accuracy. Note that, while in allpreferred embodiments multiple Bloom filters are established, it ispossible for the principles described herein to be applied to a singleBloom filter, though the advantages of maintaining multiple Bloomfilters with different sets of bits will naturally be lost if only oneBloom filter is utilized.

Next, a network sensor 100 directly observes or otherwise receives afile (Step 205) for which a determination should be made whether thefile has previously been observed or received.

Upon receiving the file, for some predetermined value K, K distincthashes of the file are generated (Step 210). In a preferred embodiment,a same hash function, such as MD5 or SHA-256, is performed to obtaineach of the K distinct hashes, and the distinct values are the result ofappending K distinct salts to the file before each hashing. In otherembodiments, the K distinct hashes may be the result of using K distincthash functions, or by performing K distinct transformations, other thansalting, to the file before evaluating the hash function. Any particularconfiguration of functions and transformations may be used, so long asthey deterministically generate K distinct values that are intended tobe randomly distributed throughout a range of equal or greater size thanthat of each Bloom filter. In a preferred embodiment, K may be set to23.

If the range of values from the hash is greater than M, the number ofbits in the Bloom filters, the hashes are normalized (Step 215) to therange of 1 to M, preferably by taking their value modulus M (i.e. , K %M).

Each Bloom filter is checked at each of the hash values (K % M) to seeif each such bit is already set (Step 220). If they are all set withinany of the Bloom filters, the method records that the file is assumed tohave been seen before (Step 225); otherwise, when every Bloom filter hasat least one of the bits unset, the method records that the file isassumed not to have been seen before (Step 230).

Regardless of the status of each of the bits and the overalldetermination, the bits at each of the hash values are set to true ineach of the Bloom filters (Step 235). In an alternative embodiment, thebits may only be set in a selected subset of the Bloom filters, so thatsome of the Bloom filters remain unchanged after an insertion.

Next, the current saturation of the Bloom filters is checked (Step 240).If the saturation does not exceed a predetermined threshold, the priordetermination is used as the basis for any necessary further action(Step 245). For example, a determination that a file has not been seenmay result in transmission of the file to the database 120 for long termstorage; transmission of the file to the analysis system 125 for staticand/or dynamic malware analysis; transmission of a digest or alert to asystem monitoring network traffic; communication with a human user toalert that user to the situation; or any other automated orhuman-assisted response. Afterward, the network sensor 100 returns towaiting for another file to be observed or processing the nextalready-observed file waiting in a queue (back to Step 205).

If the saturation does exceed the predetermined threshold, or if nothreshold is explicitly set, one or more of the Bloom filters each sweepa certain number of bits from a region of the Bloom filter and unsetseach of them (Step 250). In a preferred embodiment, the decay strategyinvolves sweeping a same number of bits from each of the Bloom filters,though in other strategies, only a subset of the filters may be swept ata time, or unequal numbers of bits may be swept from each filter. Thisprocess may also be incorporated into a database trigger that isexecuted in response to each insertion, as opposed to some software ordatabase process that is independently executed.

The number of bits may be equal to K, be equal to a fraction of K suchas K/2 or 0.9K, be equal to another fixed relation with K such as (K−1)or (K−2), or may be dynamically set to however many bits would need tobe unset to reduce saturation below the predetermined threshold. Theregion of each sweep is different in each of the Bloom filters, asillustrated in FIGS. 3 and 4 and discussed further below. In a preferredembodiment, the pointers for the sweeps are kept equidistant from eachother so that, for example, in a two Bloom filter system the pointersare always M/2 bits apart, or in a three Bloom filter system thepointers are always M/3 bits apart. In other embodiments, the pointersmay be permitted to update independent of one another.

The size of the sweep, if fixed rather than dynamic, may be set toachieve a particular desired saturation level for the Bloom filters. Dueto the inherent randomness of the Bloom filter data structure, the sweepmay reduce saturation back to the threshold, or may cause the saturationto decrease well below the threshold (for example, if an insertionexclusively set bits that were already set, yet K bits were immediatelyunset), or may not decrease saturation at all (if all the bits in thesweep happened to not have been set yet). However, the total saturationwill stabilize at around the threshold over time as overcorrections andundercorrections cancel one another out. If a fixed number of bits areunset after each insertion, the saturation of the Bloom filters willstabilize probabilistically at a given level even if the sweep is beingperformed blindly without checking the saturation of the Bloom filters.

In this method, the step of insertion is described as preceding thesweep. However, the sweep could instead be ordered before theinsertion/check step. Further, the insertion/check and the sweep couldbe decoupled, such that the sweep occurs periodically on a predeterminedschedule, rather than in response to a trigger called during theinsertion/check. If the two actions are decoupled, careful considerationwould be required to ensure that the scheduled frequency of sweeps issufficient to keep the saturation sufficiently low, but also does notunnecessarily destroy information if the Bloom filter bits are notrefreshed by a continued stream of insertions.

After the sweep has been performed, any final actions are undertaken(Step 245), including reporting the outcome of the check for priorinsertion and transmitting the file to another destination. Afterward,the network sensor 100 returns to waiting for another file to beobserved or processing the next already-observed file waiting in a queue(back to Step 205).

For purposes of clarifying how the method of FIG. 2 works in practice,FIG. 3 illustrates, in simplified form, a set of Bloom filters after afirst file insertion according to the method illustrated by FIG. 2 .

In one example embodiment, simplified for the sake of explanation, threeBloom filters 300 a, 300 b, 300 c might be established, each with 26bits 305 a-305 z, 310 a-310 z, and 315 a-315 z. Each Bloom filter 300has a persistent pointer 320 a, 320 b, 320 c to a particular bit 305,310, 315 that will be unset upon the next insertion, and the pointer 320advanced afterward. In the depicted example, the first item inserted hasthree hashes determined that correspond to the third, fifth, and seventhbits (that is, F₁(x) % 26=3, F₂(x) % 26=5, and F₃(x) % 26=7), so threebits in each Bloom filter 305 c, 305 e, 305 g, 310 c, 310 e, 310 g, 315c, 315 e, 315 g are set to true (shown by depicting each bit in black).

FIG. 4 illustrates, in simplified form, the set of Bloom filters fromFIG. 3 after a series of insertions and a sweeping reset of bitsaccording to the method illustrated by FIG. 2 .

For the sake of depicting features of the method, let there be threesubsequent items whose hashes, after reduction mod 26, are (4, 10, 18),(5, 25, 26), and (9, 10, 11). After insertion of all four items, up to12 bits might be set in each of the Bloom filters 300, though in someinstances in each table, bits previously set have been unset as thepointers 320 advance. Further, in some instances, a previously unset bitis re-set after a subsequent insertion.

After three more insertions, the pointers 320 may have each sweptforward by a total of 9 bits (3 bits per insertion, in this example), inthe process resetting some of the bits that had been previously set, asshown with hatching marks in FIG. 4 .

FIG. 5 illustrates, in simplified form, an alternative method forchecking whether a file has been seen before and inserting it into orupdating it within a set of Bloom filters.

As in the method of FIG. 2 , first, at least two Bloom filters areinstantiated (Step 500) to record instances of files being observed inthe network 105. In some embodiments, only two Bloom filters arecreated, while in others, three, four, five, or even more Bloom filtersmay be instantiated, depending on desired trade-offs of size, speed, andaccuracy. Similarly, a network sensor 100 directly observes or otherwisereceives a file (Step 505) for which a determination should be madewhether the file has previously been observed or received, K distincthashes of the file are generated by whatever method is preferred (Step510), and the hashes are normalized to the range of 1 to M (Step 515).

Again, each Bloom filter is checked at each of the hash values (K% M) tosee if each such bit is already set (Step 520). If they are all setwithin any of the Bloom filters, the method records that the file isassumed to have been seen before (Step 525); otherwise, when every Bloomfilter has at least one of the bits unset, the method records that thefile is assumed not to have been seen before (Step 530). Regardless ofthe status of each of the bits and the overall determination, the bitsat each of the hash values are set to true in each of the Bloom filters(Step 535).

Next, the current saturation of the Bloom filters may be checked (Step540). If the saturation does not exceed a predetermined threshold, themethod ends with the prior determination being used as the basis for anyfurther action (Step 545), as previously described above. Afterward, thenetwork sensor 100 returns to waiting for another file to be observed orprocessing the next already-observed file waiting in a queue (back toStep 505).

If the saturation does exceed the predetermined threshold, or if nothreshold is explicitly set, only one of the Bloom filters is selectedto have a certain number of bits from a region of the Bloom filterunsets (Step 550). In one embodiment, a random process is used such thatthe last Bloom filter is again targeted a certain proportion of the time(such as 99% of the time) and in the remainder of cases, the Bloomfilter to be targeted rotates to the next Bloom filter in the set. Inanother embodiment, the Bloom filter to be targeted may be randomlychosen every time.

Because a single Bloom filter's bits are being unset instead of eachBloom filter simultaneously, the number of bits to be unset in thechosen Bloom filter should be scaled upward in order to achieve adesired average saturation of the Bloom filters as a whole, based on howthe random process functions. For example, if there are N Bloom filters,the number of bits to unset may be equal to K*N, be equal to a fractionof K*N such as K*N/2 or 0.9*K*N, be equal to another fixed relation withK*N such as (K*N−1) or (K*N−2), or may be dynamically set to howevermany bits would need to be unset to reduce saturation below thepredetermined threshold.

In contrast to the staggered regions of sweep in the method of FIGS. 2-4, the region of each sweep may be contiguous across all the Bloomfilters, as illustrated in FIGS. 6 and 7 and discussed further below. Ina preferred embodiment, a single pointer is maintained across all theBloom filters to decide where to unset the next bits. In otherembodiments, multiple pointers may be used that are independent of oneanother.

As long as the size of the sweep is properly calibrated, the totalsaturation across all Bloom filters should stabilize at around a desiredthreshold over time. However, compared to the method of FIGS. 2-4 , thesaturation may be much more erratic across the Bloom filters, asdepicted in FIG. 9 and discussed below.

Again, in this method, the step of insertion is described as precedingthe sweep. However, the sweep could instead be ordered before theinsertion/check step, or decoupled, such that the sweep occursperiodically on a predetermined schedule.

After the sweep has been performed, any final actions are undertaken(Step 545), including reporting the outcome of the check for priorinsertion and transmitting the file to another destination. Afterward,the network sensor 100 returns to waiting for another file to beobserved or processing the next already-observed file waiting in a queue(back to Step 505).

For purposes of clarifying how the method of FIG. 5 works in practice,FIG. 6 illustrates, in simplified form, a set of decaying Bloom filtersafter a first file insertion according to the method illustrated by FIG.5 .

As was depicted in FIG. 3 , after a single insertion, three bits may beset in each of the Bloom filters 300. However, in this case, only asingle pointer 320 is used and is shared between the three filters 300.

FIG. 7 illustrates, in simplified form, the set of decaying Bloomfilters from FIG. 6 after a series of insertions and an alternatingdecay of bits according to the method illustrated by FIG. 5 .

After the same series of files as depicted in FIG. 4 are inserted, thethird Bloom filter 300 c has not yet had any of its bits unset. Incontrast, the first Bloom filter 300 a has had two sets of bits unset,with a greater amount unset than in FIG. 4 , and for the thirdinsertion, the pointer continues advancing, but now acting on therandomly chosen second Bloom filter 300 b. The overall saturation of allthree filters together is roughly the same as in FIG. 4 , but now anunpredictable number of bits may be set in any particular filter.

FIG. 8 depicts a graph displaying the experimental results of insertionsin one style of sweeping bit unsetting, in a three-Bloom filter system,according to the method of FIGS. 2-4 .

FIG. 8 depicts the not-before-seen correctness rate 800 (i.e., 100%minus the false positive rate), the previously-seen correctness rate 805(i.e., 100% minus the false negative rate), and the saturations 810 ofthree Bloom filters over time as a series of insertions and queries areperformed.

Experimental performance shows that the use of three Bloom filters inthe method of FIGS. 2-4 can result in a sustained false positive ratebelow 10%, even after a number of insertions that would, in a single,non-decaying Bloom filter, result in a nearly 100% false positive rate.Although the false negative rate is more erratic than the false positiverate (compare lines 800 and 805 in FIG. 8 ), a false negative is not asmuch of an issue during a deduplication process since it only results inwasted computation, not loss of information.

Note also that, since each Bloom filter is swept and has bits unsetafter each insertion, the saturations 810 are virtuallyindistinguishable from one another at any given moment, in contrast tothe saturations 910 seen in FIG. 9 . Even without a deliberate attemptto maintain a particular saturation level, the balance between new bitsset and old bits swept and unset will quickly reach an equilibrium—here,at about 35% of bits set, with each of the three Bloom filters having adifferent set of 35% bits set at any given time.

Even if an infinite number of insertions are performed, the falsenegative rate, false positive rate, and overall saturation of the Bloomfilters will probabilistically remain stable. There may be instanceswhere all newly-set bits correspond to bits that were already set (sothat the insertion and unsetting sweep together result in a net decreasein set bits), or that the sweep corresponds solely to bits that werealready unset (so that the insertion and sweep together result in a netincrease in set bits), but these processes will tend to asymptoticallycancel out at a stable saturation rate, and a stable false positive andfalse negative rate as a function of the saturation.

FIG. 9 depicts a graph displaying experimental results of insertions inone style of alternating decay in a five-Bloom filter system, accordingto the method of FIGS. 5-7 .

FIG. 9 depicts the not-before-seen correctness rate 900 (i.e., 100%minus the false positive rate), the previously-seen correctness rate 905(i.e., 100% minus the false negative rate), and the saturations 910 a,910 b, 910 c, 910 d, 910 e of five Bloom filters over time as a seriesof insertions and queries are performed. As can be seen in contrast withFIG. 8 , the saturations 910 vary wildly between the five Bloom filters,as only one Bloom filter is being unset at any given moment in time,while the other four continue to fill up. For example, between twomoments 915 a and 915 b, one saturation 910 c consistently decreases, asit is repeatedly the one Bloom filter from which bits are being selectedfor unsetting, until at moment 915 b a new Bloom filter is selected andanother saturation 910 e begins to decrease instead.

In the test results depicted, false negatives almost never occur (onlybriefly at moment 915 c) and other than that brief burst, the falsenegative rate remains below 1% throughout the experiment. However, thefalse positive rate is erratic, exceeding 30% at times in this example.Accordingly, a staggered sweep that affects all Bloom filters (as inFIGS. 2-4 and 7 ) may be preferable in many cases to the alternatingdecay depicted in FIG. 9 , since false negatives are generally preferredto false positives in the intended application. In embodiments where ahigher false negative rate is acceptable and a higher false positiverate is not, or other considerations are in play, parameters may be setdifferently and different numbers of bits, their locations, and theirallocations between Bloom filters may be chosen differently.

Although FIG. 1 depicts a preferred configuration of computing devicesand software modules to accomplish the software-implemented methodsdescribed above, those methods do not inherently rely on the use of anyparticular specialized computing devices, as opposed to standard desktopcomputers and/or web servers. For the purpose of illustrating possiblesuch computing devices, FIG. 10 , below, describes various enablingdevices and technologies related to the physical components andarchitectures described above.

FIG. 10 is a high-level block diagram of a representative computingdevice that may be utilized to implement various features and processesdescribed herein, for example, the functionality of the network sensors100, the database 120, the analysis system 125, or any other computingdevice described. The computing device may be described in the generalcontext of computer system-executable instructions, such as programmodules, being executed by a computer system. Generally, program modulesmay include routines, programs, objects, components, logic, datastructures, and so on that perform particular tasks or implementparticular abstract data types.

As shown in FIG. 10 , the computing device is illustrated in the form ofa special purpose computer system. The components of the computingdevice may include (but are not limited to) one or more processors orprocessing units 1000, a system memory 1010, and a bus 1015 that couplesvarious system components including memory 1010 to processor 1000.

Bus 1015 represents one or more of any of several types of busstructures, including a memory bus or memory controller, a peripheralbus, an accelerated graphics port, and a processor or local bus usingany of a variety of bus architectures. By way of example, and notlimitation, such architectures include Industry Standard Architecture(ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA)bus, Video Electronics Standards Association (VESA) local bus, andPeripheral Component Interconnect (PCI) bus.

Processing unit(s) 1000 may execute computer programs stored in memory1010. Any suitable programming language can be used to implement theroutines of particular embodiments including C, C++, Java®, Python®,assembly language, etc. Different programming techniques can be employedsuch as procedural or object oriented. The routines can execute on asingle computing device or multiple computing devices. Further, multipleprocessors 1000 may be used.

The computing device typically includes a variety of computer systemreadable media. Such media may be any available media that is accessibleby the computing device, and it includes both volatile and non-volatilemedia, removable and non-removable media.

System memory 1010 can include computer system readable media in theform of volatile memory, such as random access memory (RAM) 1020 and/orcache memory 1030. The computing device may further include otherremovable/non-removable, volatile/non-volatile computer system storagemedia. By way of example only, storage system 1040 can be provided forreading from and writing to a non-removable, non-volatile magnetic media(not shown and typically referred to as a “hard drive”). Although notshown, a magnetic disk drive for reading from and writing to aremovable, non-volatile magnetic disk (e.g., a “floppy disk”), and anoptical disk drive for reading from or writing to a removable,non-volatile optical disk such as a CD-ROM, DVD-ROM or other opticalmedia can be provided. In such instances, each can be connected to bus1015 by one or more data media interfaces. As will be further depictedand described below, memory 1010 may include at least one programproduct having a set (e.g., at least one) of program modules that areconfigured to carry out the functions of embodiments described in thisdisclosure.

Program/utility 1050, having a set (at least one) of program modules1055, may be stored in memory 1010 by way of example, and notlimitation, as well as an operating system, one or more applicationsoftware, other program modules, and program data. Each of the operatingsystem, one or more application programs, other program modules, andprogram data or some combination thereof, may include an implementationof a networking environment.

The computing device may also communicate with one or more externaldevices 1070 such as a keyboard, a pointing device, a display, etc.; oneor more devices that enable a user to interact with the computingdevice; and/or any devices (e.g., network card, modem, etc.) that enablethe computing device to communicate with one or more other computingdevices. Such communication can occur via Input/Output (I/O)interface(s) 1060.

In addition, as described above, the computing device can communicatewith one or more networks, such as a local area network (LAN), a generalwide area network (WAN) and/or a public network (e.g., the Internet) vianetwork adaptor 1080. As depicted, network adaptor 1080 communicateswith other components of the computing device via bus 1015. It should beunderstood that although not shown, other hardware and/or softwarecomponents could be used in conjunction with the computing device.Examples include (but are not limited to) microcode, device drivers,redundant processing units, external disk drive arrays, RAID systems,tape drives, and data archival storage systems, etc.

The present invention may be a system, a method, and/or a computerprogram product at any possible technical detail level of integration.The computer program product may include a computer readable storagemedium (or media) having computer readable program instructions thereonfor causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette or thumb drive, a hard disk, a random access memory(RAM), a read-only memory (ROM), an erasable programmable read-onlymemory (EPROM or Flash memory), a static random access memory (SRAM), aportable compact disc read-only memory (CD-ROM), a digital versatiledisk (DVD), a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may use coppertransmission cables, optical transmission fibers, wireless transmission,routers, firewalls, switches, gateway computers and/or edge servers. Anetwork adapter card or network interface in each computing/processingdevice receives computer readable program instructions from the networkand forwards the computer readable program instructions for storage in acomputer readable storage medium within the respectivecomputing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, configuration data for integrated circuitry, oreither source code or object code written in any combination of one ormore programming languages, including an object oriented programminglanguage such as Smalltalk, C++, or the like, and procedural programminglanguages, such as the “C” programming language or similar programminglanguages. The computer readable program instructions may executeentirely on the user's computer, partly on the user's computer, as astand-alone software package, partly on the user's computer and partlyon a remote computer or entirely on the remote computer or server. Inthe latter scenario, the remote computer may be connected to the user'scomputer through any type of network, including a local area network(LAN) or a wide area network (WAN), or the connection may be made to anexternal computer (for example, through the Internet using an InternetService Provider). In some embodiments, electronic circuitry including,for example, programmable logic circuitry, field-programmable gatearrays (FPGA), or programmable logic arrays (PLA) may execute thecomputer readable program instructions by utilizing state information ofthe computer readable program instructions to personalize the electroniccircuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It is understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general-purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks. The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the blocks may occur out of theorder noted in the Figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

The descriptions of the various embodiments of the present inventionhave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the describedembodiments. The terminology used herein was chosen to best explain theprinciples of the embodiments, the practical application or technicalimprovement over technologies found in the marketplace, or to enableothers of ordinary skill in the art to understand the embodimentsdisclosed herein.

What is claimed is:
 1. A system for receiving and deduplicating datastrings transmitted over a network, comprising: one or more sensorsdetecting data strings while in transit on the network; andnon-transitory memory comprising instructions that, when executed by oneor more processors, cause the one or more processors to: establish aplurality of Bloom filters; receive a first data string; perform a firstinsertion operation inserting the first data string into each Bloomfilter of the plurality of Bloom filters by setting same bits in eachBloom filter of the plurality of Bloom filters; select, for one Bloomfilter of the plurality of Bloom filters chosen at least in part by arandom number generation, a set of bits, whether presently set orcleared, to be unset, the set of bits having a magnitude proportionallyscaled by a factor equal to a count of Bloom filters in the plurality ofBloom filters, such that saturation of the plurality of Bloom filters asa whole remains constant after setting bits in all Bloom filters andunsetting bits in the one Bloom filter; unset the selected set of bitsin the one Bloom filter; receive the first data string again at a latermoment in time; and query each Bloom filter of the plurality of Bloomfilters to determine whether the first data string has been inserted,based on a current state of that Bloom filter.
 2. The system of claim 1,wherein in response to receiving a second data string, a different oneBloom filter of the plurality of Bloom filters, chosen at least in partby a random number generation, has a different set of bits selected tobe unset.
 3. The system of claim 1, further comprising an analysissystem to which the first data string is sent for analysis as possiblemalware at the later moment in time if none of the Bloom filters of theplurality of Bloom filters responds to the query by indicating the datastring has been inserted.
 4. The system of claim 1, wherein a number ofbits in one or more Bloom filters of the plurality of Bloom filters tobe unset is selected based at least in part on intrinsic properties ofeach Bloom filter, the intrinsic properties including at least one ofnumber of hashes generated or a size of hashes generated.
 5. The systemof claim 1, wherein a number of bits in one or more Bloom filters of theplurality of Bloom filters to be unset is selected to ensure that atleast a predetermined number of already set bits will be unset.
 6. Thesystem of claim 1, wherein bits are selected for unsetting to maintainan invariant property that no Bloom filter from the plurality of Bloomfilters has set bits that are a strict superset of another Bloomfilter's set bits, despite each insertion operation having set the samebits in each Bloom filter from the plurality of Bloom filters.
 7. Thesystem of claim 1, wherein sets of bits are unset periodically in theplurality of Bloom filters prior to the later moment in time.
 8. Thesystem of claim 1, wherein the selected set of bits is selecteddeterministically, based on one of a plurality of advancing indices,each advancing index indicating a location in its respective Bloomfilter at which to begin unsetting bits, and each advancing indexadvancing to a position beyond the selected set of bits that was unset,after the unsetting has occurred.
 9. The system of claim 1, wherein theselected set of bits is selected randomly from among all bits in the oneBloom filter.
 10. A computer-implemented method for receiving anddeduplicating data strings transmitted over a network, comprising:establishing a plurality of Bloom filters; receiving a first datastring; generating a set of distinct hashes of the first data string;performing a first insertion operation inserting the first data stringinto each Bloom filter of the plurality of Bloom filters by setting samebits in each Bloom filter of the plurality of Bloom filters; selecting,for one Bloom filter of the plurality of Bloom filters chosen at leastin part by a random number generation, a set of bits, whether presentlyset or cleared, to be unset, the set of bits having a magnitudeproportionally scaled by a factor equal to a count of Bloom filters inthe plurality of Bloom filters, such that saturation of the plurality ofBloom filters as a whole remains constant after setting bits in allBloom filters and unsetting bits in the one Bloom filter; unsetting theselected set of bits in the one Bloom filter; receiving the first datastring again at a later moment in time; and querying each Bloom filterof the plurality of Bloom filters to determine whether the first datastring has been inserted, based on a current state of that Bloom filter.11. The method of claim 10, wherein in response to receiving a seconddata string, a different one Bloom filter of the plurality of Bloomfilters, chosen at least in part by a random number generation, has adifferent set of bits determined to be unset.
 12. The method of claim10, further comprising an analysis system to which the first data stringis sent for analysis as possible malware at the later moment in time ifnone of the Bloom filters of the plurality of Bloom filters responds tothe query by indicating the data string has been inserted.
 13. Themethod of claim 10, wherein a number of bits in one or more Bloomfilters of the plurality of Bloom filters to be unset is selected basedat least in part on intrinsic properties of each Bloom filter, theintrinsic properties including at least one of number of hashesgenerated or a size of hashes generated.
 14. The method of claim 10,wherein a number of bits in one or more Bloom filters of the pluralityof Bloom filters to be unset is selected to ensure that at least apredetermined number of already set bits will be unset.
 15. The methodof claim 10, wherein bits are selected for unsetting to maintain aninvariant property that no Bloom filter from the plurality of Bloomfilters has set bits that are a strict superset of another Bloomfilter's set bits, despite each insertion operation having set the samebits in each Bloom filter from the plurality of Bloom filters.
 16. Themethod of claim 10, wherein sets of bits are unset periodically in theplurality of Bloom filters prior to the later moment in time.
 17. Themethod of claim 10, wherein the selected set of bits is selecteddeterministically, based on one of a plurality of advancing indices,each advancing index indicating a location in its respective Bloomfilter at which to begin unsetting bits, and each advancing indexadvancing to a position beyond the selected set of bits that was unset,after the unsetting has occurred.
 18. The method of claim 10, whereinthe selected set of bits is selected randomly from among all bits in theone Bloom filter.